153 research outputs found

    Generating a training corpus for OCR post-correction using encoder-decoder model

    Get PDF
    International audienceIn this paper we present a novel approach to the automatic correction of OCR-induced orthographic errors in a given text. While current systems depend heavily on large training corpora or exter- nal information, such as domain-specific lexicons or confidence scores from the OCR process, our system only requires a small amount of relatively clean training data from a representative corpus to learn a character-based statistical language model using Bidirectional Long Short- Term Memory Networks (biLSTMs). We demonstrate the versatility and adaptability of our system on different text corpora with varying degrees of textual noise, in- cluding a real-life OCR corpus in the med- ical domain

    NLP Community Perspectives on Replicability.

    Get PDF
    International audienceWith recent efforts in drawing attention to the task of replicating and/or reproducing1 results, for example in the context of COLING 2018 and various LREC workshops, the question arises how the NLP community views the topic of replicability in general. Using a survey, in which we involve members of the NLP community, we investigate how our community perceives this topic, its relevance and options for improvement. Based on over two hundred participants, the survey results confirm earlier observations, that successful reproducibility requires more than having access to code and data. Additionally, the results show that the topic has to be tackled from the authors, reviewers and community 's side

    Proposal for an Extension of Traditional Named Entitites: from Guidelines to Evaluation, an Overview

    No full text
    International audienceWithin the framework of the construction of a fact database, we defined guidelines to extract named entities, using a taxonomy based on an extension of the usual named entities defini- tion. We thus defined new types of entities with broader coverage including substantive- based expressions. These extended named en- tities are hierarchical (with types and compo- nents) and compositional (with recursive type inclusion and metonymy annotation). Human annotators used these guidelines to annotate a 1.3M word broadcast news corpus in French. This article presents the definition and novelty of extended named entity annotation guide- lines, the human annotation of a global corpus and of a mini reference corpus, and the evalu- ation of annotations through the computation of inter-annotator agreement. Finally, we dis- cuss our approach and the computed results, and outline further work

    Structured Named Entities in two distinct press corpora: Contemporary Broadcast News and Old Newspapers

    No full text
    International audienceThis paper compares the reference annotation of structured named entities in two corpora with different origins and properties. It ad- dresses two questions linked to such a comparison. On the one hand, what specific issues were raised by reusing the same annotation scheme on a corpus that differs from the first in terms of media and that predates it by more than a century? On the other hand, what contrasts were observed in the resulting annotations across the two corpora

    Approches à base de fréquences pour la simplification lexicale

    Get PDF
    National audienceLa simplification lexicale consiste Ă  remplacer des mots ou des phrases par leur Ă©quivalent plus simple. Dans cet article, nous prĂ©sentons trois modĂšles de simplification lexicale, fondĂ©s sur diffĂ©rents critĂšres qui font qu'un mot est plus simple Ă  lire et Ă  comprendre qu'un autre. Nous avons testĂ© diffĂ©rentes tailles de contextes autour du mot Ă©tudiĂ© : absence de contexte avec un modĂšle fondĂ© sur des frĂ©quences de termes dans un corpus d'anglais simplifiĂ© ; quelques mots de contexte au moyen de probabilitĂ©s Ă  base de n-grammes issus de donnĂ©es du web ; et le contexte Ă©tendu avec un modĂšle fondĂ© sur les frĂ©quences de cooccurrences. ABSTRACT Studying frequency-based approaches to process lexical simplification Lexical simplification aims at replacing words or phrases by simpler equivalents. In this paper, we present three models for lexical simplification, focusing on the criteria that make one word simpler to read and understand than another. We tested different contexts of the considered word : no context, with a model based on word frequencies in a simplified English corpus ; a few words context, with n-grams probabilites on Web data, and an extended context, with a model based on co-occurrence frequencies. MOTS-CLÉS : simplification lexicale, frĂ©quence lexicale, modĂšle de langue

    A corpus for studying full answer justification

    Get PDF
    International audienceQuestion answering (QA) systems aim at retrieving precise information from a large collection of documents. To be considered as reliable by users, a QA system must provide elements to evaluate the answer. This notion of answer justification can also be useful when developing a QA system in order to give criteria for selecting correct answers. An answer justification can be found in a sentence, a passage made of several consecutive sentences or several passages of a document or several documents. Thus, we are interested in pinpointing the set of information that allows verifying the correctness of the answer in a candidate passage and the question elements that are missing in this passage. Moreover, the relevant information is often given in texts in a different form from the question form : anaphora, paraphrases, synonyms. In order to have a better idea of the importance of all the phenomena we underlined, and to provide enough examples at the QA developer’s disposal to study them, we decided to build an annotated corpus
    • 

    corecore